I took Professor Geoffrey Hinton's neural network class and started on 2017-05-04. I'm summarizing the lecture slides, my own notes from the lectures and supplementing with other information I can find online.
Activation follows a linear function:
$y = b(bias) + \sum{x_i w_i}$
Activation follows sigmoid function:
$z = b + \sum{x_i * w_i}$
$y = \frac{1}{1 + e^{-z}}$
Activation function is on or off. b can be negative such that:
$z = b + \sum{x_i * w_i}$
$y = \begin{cases} 1 & \quad \text{if } z \geq 0\\ 0 & \quad \text{otherwise} \\ \end{cases} $
Activation has threshold and linear function beyond the threshold:
$z = b + \sum{x_i * w_i}$ (bias can be negative to elongate activation)
$y = \begin{cases} z & \quad \text{if } z > 0\\ 0 & \quad \text{otherwise} \\ \end{cases} $
Activation is probability of producing activation:
$p(s=1) = \frac{1}{1 + e^{-z}}$
Meant to help generalize the model to future data. Usually used on the cross-validation set.
Overfitting can also be avoided by:
In mini-batch SGD, turn down the learning rate towards the end of learning. It's slower learning but it helps minimize the error.
Speed up mini-batch with:
Calculate a loss function moving towards a global/local minumum. Run some data through the network, calculate loss, update weights and do it again. Calculating loss and updating weights should help reduce the overall error.
To improve SGD:
Given a set of hyperparameters, create a cartesian product of the params and run a model with each member of the set of the cartesian product.
Ex:
param a: {1,2,3}
param b: {a,b,c}
grid = {(1,a), (1,b) ... (3,c)}
for mem in grid:
run_model(mem)
Calculates a linear loss on a continuous output.
NOTE: target = the true class label
Loss Function:
$J(w) = \frac{1}{2} \sum{ (target^i - output^i)^2 }$
Calculating Change in Weights:
$\Delta w_j = - \epsilon \frac{\partial J}{\partial w_j}$ where $\epsilon$ is the learning rate and j/w_j is the partial derivative of loss funtion with respect to the changing weights
$\Delta w_j = \epsilon \sum{(t^i - o^i)x_j^i}$
Stanford Lecture Notes I find those to be more helpful than understanding the lecture notes for this section.
Calculates logistic loss on a binary output. Penalizing very wrong outputs very strongly and not so wrong outputs not so strongly.
Loss Function:
$J(w) = \displaystyle\prod_{i=1}^{m} p(y^i \mid x^i; w)$
We want to maximize the log likelihood. This is also known as the cross entropy function.
NOTE: $h(x) = \frac{1}{1 + e^{-z}}$
$log J(w) = \displaystyle\sum_{i=1}^{m}{y^i * log h(x)^i + (1 - y^i) log(1 - h(x)^i)} $
Used for classification of K number of classes. All values of the output sum to one. All outputs represent a probability distribution across discrete alternatives.
$ P(y = j \mid x) = \displaystyle\frac{e^z_i}{\sum_{k=1}^{K}{e^z_k}} $
$ J(w) = -\displaystyle\sum_j{t_j log y_j} $
NOTE: lectures didn't show how to change weights
... damps oscillations in directions of high curvature by combining gradients with opposite signs
... builds up speed in directions with a gnetle but consistent gradient
$v(t) = \alpha v(t-1) - \epsilon \displaystyle\frac{\partial E}{\partial w}(t)$
!! very important !! first make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction. The learning (by Nesterov, 1983) is that it's better to correct a mistake after it's been made. It yeilds better learning.
Use only the sign of the gradient
rprop a full batch version:
In full batch learning, hold the learning rate the same but only change the sign of the gradient. Has the advantage of escaping from plateaus with tiny gradients quickly. Increase the step size for the weight multiplicatively iff the last two updates have the same sign (indicates we're going down hill). Else, decrease the weight multplicatively (lecture suggests x 0.5).
RMSprop keeps a moving average of the squared gradient for each weight:
$MeanSquare(w, t) = 0.9 MeanSquare(w, t-1) + 0.1 (\frac{\partial E}{\partial w} (t))^2$
Wide curves vs. deep curves. On a wide curve, we want to make a big jump. On a deep curve, we want to move torwards the bottom without overshooting. Multiply by the inverse of the curvature matrix. Use conjugate gradient.
Conjugate means that as you go in the new direction, you do not change the gradients in the previous directions
Conjugate gradient is guaranteed to find the minumum of an N-dim quadratic surface because calculating the gradient of the quadratic surface moves towards a global minimum. Hand-wavey math guarantees we always move towards a better place in the next step. The steps are "conjagate" (better terminology is coupled IMHO) and coefficients are calculated to maximize the step size towards the minumum error.
Good Explanation Brush up on quadratic approximations @ Kahn Academy
SGD is a first order optimization problem. It will optimize linear local curves. HF (and conjugate gradient) use second-order methods which supply the means to deal with seeking a minimum across the whole surface. Surfaces are curved and planes are flat. We need surface math.
In Conjugate Descent, $\alpha$ can define the step size and $\beta$ can define the direction. $\alpha$ is calculated using a line-search algo and $\beta$ is a scalar value respecting the direction of the last step (conjugate)
Predicts the next output given a sequence of previous outputs. Uses "delay taps." Reminds me of commonly predicting the weather: tomorrow it will rain because the last two days in a row have rained.
Generalize autoregressive models bu using one or more layers of non-linear hidden units.
Prevent overfitting with
#
of hidden layersHelp avoid overfitting and can easily extend beyond NN
Find models that have different biases and boost weights in different areas to enhance accuracy of overall prediction. "Mixture of experts" can model particular subsets of the data; global vs. local models.
Use dropout to improve generalization
Instead of trying to find the best single setting of the parameters (as in Maximum Likelihood or MAP) computer the full posterior distribution ove all possible parameter settings
In cases where there is little data, developing posterior estimates is expensive but helpful to avoid overfitting. Using grid search with 6 weights and 9 discrete values produces 9^6 grid-points. Lots of computation but helps direct the modeling and may improve generalization.
Amazing Fact: if we use just the right amount of noise, and if we let the weight vector wander around for long enough before we take a sample, we wil lget an unbiased sample from the true posterior ovre the weight vectors.
This is Markov Chain Monte Carlo and it makes it possible to use full Bayesian learning with thousands of params. Learning involves allowing the models to "wander around" the weight space and explore minima (wrt to cost function). As we encourage learning to trend towards minima, but still allow exploration, we observe samples that find optimal weights.
A "surprising shortcut" for learning in models that would normally maximize the log likelihood. Does not follow a gradient but works well in practice. Useful in Boltzmann and RBMs (generative binary architectures).
$\Delta w_{ij} = \epsilon ( <v_i h_j>^0 - <v_i h_j>^1 )$
Starting at the data, the MCMC begins to favor configurations with lower energy. We can observe the direction of this favor and inturrupt learning before it completely reaches equilibrium. Inturruption is okay becuase we know the weights are "bad" (the weights are updating away from predicting the data). Learning happens by lowering the probability of configurations that do not result in the known visible states and raise the probability of the configurations that do result in known visible states.
Learning cancels out once the confabulations and the data have the same distribution
In practice, beginning with small weights and CD1 (1 = one full step) is quick and mostly correct. As weights grow, the effects of the MCMC are less pronounced and we can use CD3. As weights continue to grow move up to CD10.
CD learning rule for a binary unit is the same as for a softmax.
wikipedia has a good explanation too
It's hard to learn a complicated model like a sigmoid belief network because calculating the posterior distribution overall all hidden configurations is nearly intractable.
Key thought: Do inference wrong and hope that learning still works. The wrong part is to assume that the posterior over all hidden configurations factorizes into a product of distributions for each hidden unit.
For three hidden units with probability = (0.3, 0.6, 0.8)
$p(1,0,1) = 0.3 * (1 - 0.6) * 0.8$
Wake Phase: Recognition weights to reconstruct activities in each layer (moving from input layer into the network)
Sleep Phase: Generative weights to reconstruct weights from each layer moving from hidden layers into the input (reconstructs input)
Flawed because:
Mode averaging - inputs with modes of (0,1) or (1,0) will result in recognition weights learning (0.5) and (0.5). A better solution would be to just pick one of (0,1) or (1,0).
Small datasets or large datasets without much redundancy, use full-batch method
Big, rundant datasets, use mini-batch
NOTE: this is a verbatim snapshot of a slide from lecture 13. I think it helps describe the practicality or feasibility of employing certain statistical techniques versus more complicated ML techniques
A very simple network architecture. Features are not learned, they're designed, and weights are learned.
Guaranteed to work:
foreach(trainingex) {
if output is correct, don't change weights
if output is 0, add input vector to weights
if output is 1, subtract input vector to weights
}
$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl} \{ 0 \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\ \{ -y^{(i)} \pmb{x}^{(i)} \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\ [-1, 0] \times y^{(i)} \pmb{x}^{(i)}, & \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\ \end{array}$
Weights from multiple models can be averaged and produce another valid model
Can find patterns, but not patterns that "wrap-around"
Generic structure of NN. Many special case instances follow
Good for processing sequences of data, speech and image recognition. Use internal memory.
Cannot know the hidden states. We could only know a probability distribution of space.
This is backprop assuming SGD:
Inputs are multiplied by weights into a hidden neuron. Hidden neurons are then multiplied by separate weights to produce either more hidden neurons our output neurons. Forward pass complete. Error is computed and then weights in each layer (input => hidden, hidden => output) are updated once more.
Lectures recommend forward pass using squashing functions (like logistic) to prevent vectors from exploding. Backward pass should be linear. Forward pass determines slope function for backpropagating through each neuron.
4 Effective ways to learn an RNN:
1) LSTM 2) Hessian Free Optimization 3) Echo State Networks - means of initializing the connections very carefully so that each hidden state has a reservior of weakly coupled scillators 4) Good initialization w/ momentum
Add a penalty for changing any of the hidden activities too much (noted in HF)
Characterized by a learning procedure of repeating extracting features (subsampling) and pooling those features (convolutions). This process reduces the number of total features learned thus getting around the curse of dimensionality. It also helps generalize the model.
Using McNemar's test can help diagnose the severity of errors and help understand if the network is improving with different conditions.
an implementation of RNN with read gate, write gate and keep gate
A reservoir of hidden units that are set random and fixed are used to learn the last layer (usually linear model).
Fix the input -> hidden and hidden -> hidden connection to random values. Only learn the hidden -> output layer. Use sparse connectivity. It creates a lot of loosely coupled oscillators. This was modeled in lecture with a sine wave.
A special case RNN characterized by binary threshold neurons without any hidden units. Guaranteed to converge to a local minimum but will sometimes converge on a false pattern. John Hopfield realized that if connections are symmetric there is a global energy function. Symetric = one connection weight and the binary states of two neurons.
NOTE:
s = state
E = energy
$ E = - \sum_i{s_i b_i} - \sum_{i<j}{s_i s_j w_{ij}} $
Quadratic energy function enables the calculation of each individual unit's effect on the global energy:
$ Energy gap = \Delta E_i = E(s_i = 0) - E(s_j = 1) = b_i + \sum_j{s_j w_{ij}} $
-E = goodness
With states = 1 and -1, $\Delta w_{ij} = s_i s_j$. With states 0 and 1, $\Delta w_{ij} = 4(s_i - \frac{1}{2})(s_j - \frac{1}{2})$
Spurious minima are a pervasive problem. Two local minima that are close to each other in space can combine into a deeper local minimum. Difficult to overcome. Unlearning can be very effective in overcoming this shortfall. "Breaking" connections improves the robustness of the net. E. Dardiner suggests cycling through the training set many times using the perceptron convergence procedure during training.
Given a training set of binary vectors, [a BM will] fit a model that will assign a probability to every possible binary vector
$p( Model_i \mid data) = \displaystyle\frac{p(data \mid Model_i)}{\sum_j{p(data \mid Model_j)}} $
Very similar to Hopfield NN:
"Boltzmann machines can be seen as the stochastic, generative couterpart of Hopfield nets." Wikipedia
Learning happens by picking a visible state (i.e. a binary vector), sum over all possible hidden state probabilities that could produce that visibile state:
The math of a causal generative model (Boltzmann Machines are not causal):
$p(v_{vec}) = \displaystyle\sum_h{p(h) * p(v \mid h) }$
The probability of the observed vector and hiddent units is proportional to the exponent of the energy of the model:
$p( \mathbf{v}, \mathbf{h}) \text{ } \alpha \text{ } e^{-E(\mathbf{v}, \mathbf{h})}$
Joint Configuration:
$-E(\mathbf{v}, \mathbf{h}) = \displaystyle\sum_{i \in vis}{v_i b_i} + \displaystyle\sum_{k \in hidden}{h_k b_k} + \displaystyle\sum_{i \lt j}{v_i v_j w_{ij}} + \displaystyle\sum_{i,k}{v_i h_k w_{ik} } + \displaystyle\sum_{k < l}{h_k h_l w_{kl} } $
$ v_x \text{has binary state } b_k \text{bias of unit k} $
Taken directly from the slide:
The probability of a join configuration over both visible and hidden units depends on eh energy of that joint configuration compared with the energy of all other joint configurations:
$ p( \mathbf{v}, \mathbf{h}) = \frac{\displaystyle\exp(-E(\mathbf{v}, \mathbf{h})}{\displaystyle\sum_{u,g}{exp(-E(( \mathbf{u}, \mathbf{g}))}}$
The probability of a configuration of the visible units is the sum of the probabilities of all the join configurations that contain it:
$p(\mathbf{v}) = \frac{\displaystyle\sum_h{\exp(-E(\mathbf{v}, \mathbf{h}))}} {\displaystyle\sum_{u,g}{\exp(-E(\mathbf{u}, \mathbf{g}))}} $
NOTE: I think that the notation $\mathbf{v}$ indicates the notion that visible states are being held fixed and the hidden states are updating. $\mathbf{u, g}$ indicates that all nodes are being updated.
(this is the lecture where the table of all possible visible units, hidden units, Energy, exp(Energy) and probability was generated)
Calculating the normalizing term is computationally infeasible so, instead, we use MCMC to sample the starting from random global configurations. Continue to run the Markov chain until thermal equilibrium at which point the probability of the global configuration is related to its energy by the Boltzmann distribution.
"Clamping" the visibile units (using training data) and allowing the hidden units to update stochastically helps the model learn the states of the best configuration of hidden units to explain the observed data.
Learning in network architecture involves maximizing the product of the probabilities that the machine assigns the binary vectors in the training set. It's equivalent to maximizing the probability that we could produce the N training cases (binary vectors) if we let the network settle to its equilibrated distribution.
$\Delta w_{ij} \text{ } \alpha \text { } \langle s_i s_j \rangle _{data} - \langle s_i s_j \rangle _{model}$
NOTE: <s_i s_j>
is the expected value
$\displaystyle\frac{\partial log p(v) }{ \partial w_{ij}} = \langle s_i s_j \rangle _{\textbf{v}} - \langle s_i s_j \rangle _{model} $
The "positive" phase holds v constant and finds hidden node configurations that lowers the overall energy. This is done for every data vector in the training set. The "negative" phase randomoizes all nodes and finds best competing configurations and raises that configuration's energy. Inefficient, but helpful to collect the stats required for learning.
No backprop is needed in this model because the process of moving towards thermal equilibrium propogates information about the weights.
A specialized implmentation of Boltzmann Machines with only one layer of hidden units and no connections between the hidden units.
This lecture was primarily focused on CD learning.
Use case of collaborative filtering was used:
Matrix factorization works well and has certain error. RBMs work about the same but has a different error profile. In an ensemble, these models are combined and work very well together.
Users, Movies and Ratings (softmax output). Each user has rated N movies and probably a different set of N movies. Instead of a global RBM, each user is modeled with its own RBM with shared hidden units among all users. Movie-Rating weights for rated movies are shared.
A directed acyclic graph composed of binary stocastic neurons. Like an RBM it's a generative model where learning is focused on generating configuration that produces inputs. Unlike RBM, it's not energy based, it's causal.
Learning is focused on maximizing the log probability of the training vector (binary) is produced from it's anscestor. The math is based on some inference algos that were developed to explain graphical models.
$p \equiv p(s_i = 1) = \frac{1}{1 + \exp(-b_i - \displaystyle\sum_j{s_j w_{ji}})} $
$\Delta w_{ji} = \epsilon * s_j (s_i - p_i)$
Complicated learning routine because one must sample from a posterior that is not factorial (it's not factorial because hidden nodes that are independent become dependent when connected in the NN and are both used to influence the outcome of the visibile node). All weights interact - the posterior depends on the prior + the liklihood of the parent layers.
A generative graphical model that combines unsupervised learning to act as feature detectors with supervised learning to classify.
Learning happens by linking unsupervised NNs (RBMs etc) together, applying contrastive wake-sleep (or CD according to wikipedia) at each step going "up" the network (generation) and then backprop on the way down (discrimination).
A means of doing PCA type work in NN architectures.
An example is stacking 4 RBMs from a set of inputs. i => W_1 => h1 => W_2 => h2 => w_3 => h3 => w_4 => stack of n linear units => W_4^t => h_5 => W_3^t => h6 => W_2^t => h7 => W_1^t => output (roll and unroll the network)
A NN is trained to reproduce its input as its output but the middle hidden layer is a much smaller dimensional array of units. That small number of units becomes a good way to compare different things (like documents and images)